The idea of this document is to draft what we need to do to integrate the GSEA pathway analysis with the diff. expression analysis we did with DESeq2. We propose the following pipeline:
We fit a model with a formula of the type Count_ij ~ Treat_i + Cell_j + Interaction_ij
For every gene, test an hypothesis if there is a treatment effect in the cell specific expression.
For every gene, summarize the TPM level of each treatment, cell and interaction.
For every cell, take the genes that are differentially expressed and are among the top \(K\) most expressed genes for both treatments (the assumption here is to avoid cases where the log2FC value is extreme due to a very low quantity of reads in either treatment).
Using the t.stat values as a signal-2-noise metric, do a pathway analysis with GSEA.
TPM may not be appropiate to summarize the signal. I am going to use the rlog matrix which is similar to the log2 scale of count data, but minimizes differences between small counts. This transformation would be expected to reduce the extreme nothing-to-all log2FC
We further explore the genes in the red-square there are the genes that go from being unexpressed without treatment to be very expressed with treatment. In total, there are:
| batch | cell | genes_in_square |
|---|---|---|
| new | dRdZ | 6 |
| new | NOKS | 2 |
| old | NOKS | 1 |
In the figure above, it is shown that as expected when no treatment applied, those genes are not expressed and very likely there is no signal if we observe the tracks.
We filtered out genes for the GSEA analysis, by considering only the genes inside the orange lines that are differentially expressed (defined as genes with adj. p.value \(\leq 0.01\)):
| batch | cell | total_genes | diff_genes |
|---|---|---|---|
| old | EBV | 16057 | 2398 |
| old | NOKS | 16057 | 4422 |
| new | EBV | 15876 | 1707 |
| new | NOKS | 15876 | 2575 |
| new | dRdZ | 15876 | 2176 |